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Abstract. Distance-based methods such as UPGMA (Unweighted Pair Group Method 
with Arithmetic Mean) continue to play a significant role in phylogenetic research. We 
t-H use polyhedral combinatorics to analyze the natural subdivision of the positive orthant 

induced by classifying the input vectors according to tree topologies returned by the 
algorithm. The partition lattice informs the study of UPGMA trees. We give a closed 
£h form for the extreme rays of UPGMA cones on n taxa, and compute the normalized 

volumes of the UPGMA cones for small n. 

I—* 

1. Introduction 

P-l The UPGMA algorithm (Unweighted Pair Group Method with Arithmetic Mean) [0] is 

an agglomerative tree reconstruction method, that takes as input Q) pairwise distances 
(dissimilarities) between n taxa and returns a rooted, equidistant tree with these n taxa 

JO as the leaves. UPGMA is a greedy heuristic that attempts to compute the Euclidean 

projection onto the space of all equidistant tree metrics [3]. The UPGMA algorithm 
subdivides the positive orthant K>o into regions based on which combinatorial type 
of tree is returned by the algorithm. The goal of this paper is to study the geometry of 
these regions in order to understand both how the regions relate to one another as well 

CN as the performance of the algorithm. 

UPGMA has poor performance if the data is tree-like but does not follow a molecular 
clock. In spite of this limitation, we find UPGMA an interesting algorithm to study 
because it is one of the few phylogenetic reconstruction methods that directly returns a 
rooted tree on a collection of species. One motivation for studying the UPGMA algorithm 
was the work of Aldous pQ, where it was observed that rooted trees that have been 
constructed from data do not typically have the same underlying statistics as familiar 

^ speciation models such as the Yule process. This raises the question of whether or not 

the Yule process is flawed, or the trees that have been constructed are biased because of 
taxa selection, or inherent bias in the reconstruction methods. We believe that analyzing 
the partition of data space induced by a tree reconstruction method can give some insight 
into the latter problem: if regions corresponding to some tree shapes are inherently larger 
than others, this indicates that the algorithm might favor those shapes in the presence of 
noise or model misspecification of the equidistant assumption. 

With these motivating problems in mind, we study the decomposition of space induced 
by the UPGMA algorithm. For a given binary phylogenetic Y-tree T (that is, with 

leaf labels Y but without edge lengths), the region of V(T) C R™ ( n ~ 1)/2 of dissimilarity 
maps for which the algorithm returns the phylogenetic Y-tree T is a union of finitely many 
polyhedral cones, one for each ranking function of the interior nodes of T. We give explicit 
polyhedral descriptions of the cones including facet defining inequalities and extreme rays, 
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for all T and all n. In particular, each cone has 0(n 3 ) facet defining inequalities but 
exponentially many extreme rays. We compute the spherical volumes of the regions V(T) 
for n < 7. These volumes give a measure of the proportion of dissimilarity maps for which 
UPGMA returns a given combinatorial type of tree. In particular, our computations seem 
to indicate that highly unbalanced trees have small volume UPGMA cones compared to 
more balanced trees. Our computation of spherical volumes builds on the Monte Carlo 
strategy in [3]. 

2. Ranked Phylogenetic Trees and the UPGMA Algorithm 

The UPGMA method is an agglomerative tree reconstruction method that takes as an 
input (2) pairwise distances between a set of taxa X and returns a rooted equidistant 
tree metric on X. In this section, we review necessary background on ranked phylogenetic 
trees and the lattice of set partitions as they pertain to describing the UPGMA algorithm. 
We refer the reader to [5] and [5] for background on phylogenetics. 

Definition 2.1. Let X be a finite set. A phylogenetic X-tree is a tree T with leaves 
bijectively labeled by the set X. A phylogenetic X-tree is rooted if it has a distinguished 
root node p. It is binary if every interior vertex that is not a leaf has degree 3 except for 
the root p, which has degree 2. 

Throughout this paper, unless stated otherwise, we assume that a tree T on n taxa is 
a rooted binary phylogenetic X- tree where X — [n]. In a rooted binary phylogenetic X 
tree, p is not labeled by an element of X. 

A vertex v £ V(T) is a descendant of u £ V(T) if the path from pto v includes u. This 
relation induces a partial order on the vertices of T and we can write u <t v. Let V° 
denote the set of interior (i.e. nonleaf ) vertices of T. A rank function on T is a bijection 
r : V° — > {1, 2, |V Q |} satisfying u <t v — Y r{u) < r{v). The number of rank functions 
on T is : |^ |!/Iluey° |de(u)| where de(v) denotes the set of descendants of v in the set 
V° [TP] . Note that v <t v, so that the number of descendants of v will include v itself. A 
tree with a rank function is called a ranked phylogenetic tree. 

The lattice of set partitions provides a useful alternate description of ranked phyloge- 
netic trees. See [10J for background and terminology for the theory of partially ordered 
sets. Let Il n consist of all partitions of a set with n elements. For simplicity, we iden- 
tify this underlying set as [n] = {1,2, .. . ,n}. Partitions are unordered, and consist of 
unordered elements. The shorthand Ai| . . . |v4^ denotes a partition with k parts. For 
example 12 1 345 is shorthand for the partition {{1,2}, {3,4,5}}. 

Partitions in IT n are ordered by refinement, so A\\ . . . \ Ak < B\ \ . . . \Bf if and only if for 
each % £ [k] there exists a j £ [£] satisfying Ai C Bj. Every maximal chain in the lattice 
of set partitions corresponds to a ranked phylogenetic tree. Indeed, consider a maximal 
chain 

C = 1|2| ■ ■ ■ \n = 7T n < 7T n -l < ■ ■ ■ < 7T 2 < 7Ti = 12 ■ ■ ■ U 

in il n . We use < to denote a covering relation in the partial order II^, and we use the 
convention that 7Tj is always a partition with % parts. 

Given 7Tj £ C, we write 7Tj = A5JA|| • • • |A|. When 7Tj < 7Tj_i, there are exactly two blocks 
Xj, X\ that are joined in 7Tj_i but distinct in 7Tj. If v £ V° where r(v) = n — i, then 7Tj_i 
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joins the two blocks in Hi that correspond to the subtrees of T induced by the child nodes 
of v. 

The UPGMA algorithm constructs a rooted ranked phylogenetic X tree from a dissimi- 
larity map d, as well as an equidistant tree metric 5 which approximates d. The algorithm 
works as follows: 

Algorithm 2.2 (UPGMA Algorithm). • Input: a dissimilarity map d G IR™^ 1 
onX 

• Output: a maximal chain C in the partition lattice Il n and an equidistant tree 
metric 5. 

• Initialize ir n — 1|2| • • • \n, and set d n = d. 

• For i = n — 1 , . . . , 1 do 

- From partition = ■ ■ ■ |A*+} and distance vector d i+1 G R^ 1)i/2 
choose j, k be so that cf +1 (A* +1 , A* fc +1 ) is minimized. 

- Set 7Tj to be the partition obtained from ir i+ i by merging A* +1 and A^ +1 and 
leaving all other parts the same. Let A - = A* +1 U A^ +1 . 

- Create new distance d i G R^ " 1)/2 by d\\ A') = d i+1 (\,\') if A, A' are both 
parts of 7Tj + i and 



d\\K) = l ^Qd l+1 (X,X^) + ^^ +1 (A,A^) 

- For each x G A} +1 and y G set 5(x,y) = <f +1 (A} +1 , A^ +1 ) 
• Return: Chain C = ir n < ■ ■ ■ < ir 1 and equidistant metric 5. 

Note that that step which recalculates distances, the weighted average 

IA m l IX i+1 l 



is used to determine the new distance. This is simply a computationally efficient strategy 
to compute the average of distances 



(1) dl ^ X ' ) = WTV\ £ d{x ' v) 

a formula we will make use of later. 



Example 2.3. Let d = (1, 2, 1.8, 1.7, 2, 2.6, 3.1, 2.4, 2.6, 1.2) G M| ( 5 1)/2 , be a dissimilarity 
map on 5 taxa. 

The UPGMA algorithm performs the following steps, where an underline is used to 
denote the smallest value in the present metric. 

12 13 14 15 23 24 25 34 35 45 
(1, 2, 1.8, 1.7, 2, 2.6, 3.1, 2.4, 2.6, 1.2) 

12,3 12,4 12,5 34 35 45 
(2, 2.2, 2.4, 2.4, 2.6, L2) 



RUTH DAVIDSON AND SETH SULLIVANT 

12,3 12,45 3,45 
2 2.3 2.5 

123,45 
2.367 



where 



2.367 
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The resulting rooted metric tree produced by the UPGMA algorithm is displayed in 



Figure 2.3 




Figure 1. The tree metric S 

The corresponding chain in the lattice of partitions II5 is 

C = 1|2|3|4|5 < 3|4|5|12 < 3|12|45 < 45|123 < 12345. 

3. UPGMA REGIONS AND UPGMA CONES 

The UPGMA algorithm takes as input a dissimilarity map d G R>q and returns a 
rooted equidistant tree metric. If we ignore the resulting metric tree that is output, and 
only record the rooted tree computed at each step of the algorithm, the UPGMA algorithm 
produces a rooted tree and a ranking function of the internal nodes corresponding to 
precisely one maximal chain in the partition lattice. Our goal is to understand the set 
of dissimilarity maps d, such that the UPGMA returns a rooted tree T, or equivalently 
a given chain C in the partition lattice IT n . For a given leaf-labeled rooted tree T let 
V(T) C ]R™g™ 2 denote the closure of the set of dissimilarity maps such that the UPGMA 
algorithm returns T. The set V(T) is called the UPGMA region associated to the tree T. 
Similarly, for a maximal chain C in IT n , let V(C) C M"^ 1-1 ^' 2 denote the closure of the 
set of dissimilarity maps such that the UPGMA algorithm returns the chain C. 

Our goal in this Section is to describe the sets V{T) and V(C). Clearly V(T) = UV(C) 
where the union is over all maximal chains in U n whose associated tree is T. 

Theorem 3.1. For each chain C 6 Il„ the set V(C) is a pointed polyhedral cone. The 
cone has 0(n 3 ) facet defining inequalities, and exponentially many extreme rays. Each 
covering relation in the chain C determines a collection of facet defining inequalities for 
V(C). Each element of the chain C determines a collection of extreme rays ofV(C). 
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We refer the reader to [llj for background material on polyhedral geometry. To prove 



Theorem 3.1 we will provide a more general result for the description of cones associated 



to partial chains. A partial chain C is a sequence 

7T S < 7T s _i <■■■ <7T t 

for some n > s > t > 1. The fact that these are covering relations guarantees that at 
each step, 7Tj + i < 7Tj we are simply joining a pair of parts together. This means that any 
partial chain C can be intermediate information that is calculated between steps s and t 
of the UPGMA algorithm. 

For a partial chain C, let V(C) denote the set of all dissimilarity maps d G R>q 2 
which the UPGMA algorithm could produce on steps s through t of the algorithm. The 
coordinates in the space IR S ( S_1 )/ 2 are the s(s — l)/2 distances d(\j, X s k ). 

Proposition 3.2. Let C be a partial chain in Ii n . Let V(C) C ]R S ( S_1 )/ 2 be the set of 
dissimilarity maps for which steps s through t of the UPGMA algorithm return the partial 
chain C. For each covering relation 7ti < 7r^_i let AVn and AL^ be the pair of parts of 
7Tj that are joined in 7Tj_i. Then V(C) is the solution to the following system of linear 
inequalities: 

d(Xj, X s k ) > for all j, k 
for i — s, . . . , t — 1, and for all pairs j, k ^ k(i) 

■a . y . i E i^ii^i^^)<TTi^| E \*mm>K) 

j(i)W k(i)\ AJCA^.s.Af CAL-s j A?CA%AfCA| 

j— j(i)' k — k(%) J— J' k— k 

Note that if s > t we only need the nonnegativity constraint d{X s -r s s, XL s s) > 0, as the 
other inequalities d(Xj, X s k ) > follow from e?(A?/ s, AL s < d(Xp X s k ). 

Proof. At step i of the UPGMA algorithm, we choose the pair of A*^ and X\u\ to merge 
such that d l (Xj^, AjUx) is minimized. Using the formula 



I j 1 1 k\ 

twice shows that 



xeX),yeXl 



dl{x ^ K)= \xW\ E X J K'KX^.Xl). 

j\\ k\ ^»cA*-,Af CAt 

J — ] ' k— k 

This yields precisely the inequalities in the statement of the proposition at step i. □ 

Proposition 3.3. Given a maximal chain C G Ti n , there are 0(n 3 ) facet defining in- 
equalities for V(C). 

Proof. At step t, there are (*) ways to merge two blocks of 7r t , and the pair of parts 
d(X t j^, X k ^) merged at step t can be paired with (*) — 1 other pairs of parts. So (*) — 1 



6 RUTH DAVIDSON AND SETH SULLIVANT 

new inequalities are introduced at step t. An elementary identity for binomial coefficients 
tells us that for a, b > 0, YT r =b (D = (&+!)• Thus there are 

facet defining inequalities. □ 

Now we provide a description of the extremal rays of the cones of partial chains V(C), 
for partial chains starting with the bottom element ir n = l|2|---|n. The polyhedral 
description of the cones V(C) for more general partial chains is used in the proof of the 
main cases of interest. 

Definition 3.4. Given a partition -Kf. = Ai| A2I • • • |A& G n n a traversal of 7tk is a subset 
F C (^) of size (2), where each element of F is a pair {p, p'} G it satisfying p G A,p' G A'. 
There is precisely one such pair p, p 1 for every pair of parts A, A' of 71"/-. 

For example, the partition 12|3|45 has 2 2 • (2 • 1) • (2 • 1) = 16 traversals. 

Definition 3.5. Let iik = Ai [ A2 1 • • • |A& G IT n . Let F be a traversal of TTfc. The induced 
vector of F, denoted v(F), is the vector in IR( 2 ) such that 

(1) v(F)ij = if the pair i,j is not in the traversal F. 

(2) i,j G F, v(F)ij = \\k(fi\\hy)\ where % G A fcW and j G A fe(i ). 

Consider the traversal {{1, 3}, {1, 4}, {3, 5}} of the partition 12|3|45. This traversal 
induces the vector (0, 2, 4, 0, 0, 0, 0, 0, 2, 0). 

Theorem 3.6. Let C = 7i n < 7T n _i < • • • < 7i t be a grounded partial chain in U n . Then 
V(C) is a cone with extreme rays given by the set of vectors 

{e(k, I) : k,l are not in the same part of the partition 7c t } 

n 

U U { V (^ '■ F is a traversal of } 

i=t-l 

Note that e(k, I) denotes the standard unit vector in ]R n ( n_1 )/ 2 with a 1 in the k, I 
position and a elsewhere. 

The remainder of this section consists of the proof of Theorem 3^ and completes our 
description of the cones V(C). The proof will be broken into a number of pieces, and will 
work by induction on both t and n. 

Let 1* denote the vector in ]R*( i_1 )/ 2 all of whose coordinates are equal to one. Note 
that l n is the induced vector of the single traversal associated to the partition 1|2| • • • \n, 
which appears in every partial chain. 

Lemma 3.7. Let C = tt s < • • • < n t be a partial chain in U n with s > t. Then 

(1) I s is an extreme ray ofV(C) and 

(2) I s is the only extreme ray ofV(C) that has a nonzero (Aj^, AL a s) coordinate where 
(^jt s ), ^k(s)) ^ s the pair of parts joined together in the partition 7r s _i. 
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Proof. First of all, all the inequalities of Proposition |3.2| are satisfied with equality by I s 
so that I s e V(C), except for the single inequality d(Xj( s \, Xk( s )) > 0, which is satisfied 
strictly. Hence the extreme ray I s is in the intersection of all the facet defining inequalities 
except for one. Since V(C) is a pointed cone because it is contained in the positive orthant, 
this implies that I s is an extreme ray. This proves part (1). Furthermore, since every 
extreme ray of a cone is the intersection of some of its facet defining inequalities, every 
other extreme ray must have the inequality d(Xj^), Xk( s )) > as an active inequality. This 
proves part (2). □ 



Note that Lemma 3.7 implies that if s > t, the vertex figure of V(C) is a pyramid with 
apex I s . 

Let C = 7i s < ■ ■ ■ < ir t be a partial chain, and C a partial chain obtained as a final 
segment of C, that is, there is a s < u < t, such that C = 7r u < • • • < iTf The UPGMA 
algorithm induces a natural linear map A(C, C) : ]R S ( S_1 )/ 2 — y K"^ -1 )/ 2 . In particular, it 
is defined by 

(A(C,C')d)(XA') = j^r £ H//kW) 



jJ.,fJ. f G7T s 



where A, A' are parts of tt u . Note, in particular, the quantity d(p,fi') only appears in 
the formula for (A(C, C')d)(X, A'), so that A(C,C) is a coordinate substitution map 



(Definition 3.9) when restricted to the coordinates //) where fj,, fj! are in different 
parts of 7r s . 

With the preceding paragraph in mind, we let V(C) denote the intersection of P(C) 
with the hyperplane {d : d(Xjf s ), A&( s )) = 0}. 

Proposition 3.8. Let C = tt s < • • • < 7T t be a partial chain and with final segment 
O = 7r s _! < ... < 7r t . Then A(C,C) : V{C) ->■ V{C) is surjective, and V(C) = 
A(C, C')- l {V{C')) nM^ 1)/2 ~ 1 . 

Proof. Note that by definition of the UPGMA algorithm, the map A(C,C) : V{C) ->■ 
V(C) is surjective. If a vector d s e V(C), then so is the vector 

d' = d s - rf s (A^ s) , Afe( s ))e(A*( s ), Afc (s) ), 

obtained by zeroing out the (Aj/ s ),AL s coordinate. However, A{C,C')d s = A{C,C')d', 
which implies that A(C,C) : V{C) -»■ V(C) is surjective. 

To see that V{C) = A{C, C')-\V(C')) n M> ( S_1)/2_1 , note that the inequalities that 
describe V(C) are precisely the pullbacks of the inequalities that describe V(C), plus 
nonnegativity constraints, since none of the inequalities on V(C) coming from the covering 
relation tc s < 7r s _! are needed. □ 

Definition 3.9. A linear transformation <fi : M. n — y M. m is a coordinate substitution if for 
each of the coordinate vectors e«, 0(ej) = Cie a u) with q > 0, where a : [n] — > [in). That 
is, each coordinate maps to a scaled version of another coordinate. 

Lemma 3.10. Let D C W 71 be a polyhedral cone, : IR n — )■ IR m be a coordinate substitution 
with associated map a, and C C W 1 a polyhedral cone such that 4>(C) = D. Suppose that 
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C = ]R> n</) _1 (-D). Let V be the set of extreme rays of C . Then extreme rays of D consist 
of all vectors obtained by the following procedure: 

For each extreme ray J2j a j e j e Vj consider all vectors of the form Ylij a 3l 1 c P(i) e P(j) 
ranging over all functions (3 : [m] — > [n] such that a((3(j)) = j for all j . 

Proof. It suffices to show that under the hypotheses of the Lemma, every extreme ray of 
C maps onto an extreme ray of D. Indeed, if that is the case, the extreme rays of C are 
precisely the vertices of the polytopes <p~ l (v ) D M> as v ranges over the extreme rays of 
V. Note that since is a coordinate substitution cf)~ l (v) is isomorphic to a product of 
simplices, the simplices being defined over coordinate subsets over the form ct -1 (j). The 
vertices of these products of simplices have the form of the statement of the Lemma. 

Hence, it suffices to show the claim that every extreme ray of C maps onto an extreme 
ray of D. So suppose that v' is an extreme ray of C such that <j>(v') — v is not an extreme 
ray of D. Then there exists w,u G D, not equal to v such that v — w + u. Using these 
vectors, we construct w',u' G C not equal to v' such that v' = w' + v! . For each i such 
that a(i) = j define 

= —v, and u, = —v.. 
% Vj 

Clearly with this choice, we have v' — w' + v! since Vj = Wj + Uj, and both w' and u' 

consist of nonnegative vectors. Also, since w,u not equal v, neither are w',u' equal to v'. 

So we must show that 4>{w') = w and <f>(u') = u. But 



Wj Wj \ ^ Wj 



Vj Vj * — ' V 



Vj = Wj. 



i:a(i)=j J J i:a{i)=j 



J 



Similarly for u', which completes the proof. □ 



We now have all the ingredients to prove Theorem 3.6 



Proof of Theorem 3.6. Let C = ir s < ■ ■ ■ < n t . First of all, note that if s — t, then V(C) 
is the positive orthant in M s ( s-1 )/ 2 , whose extreme rays are the standard unit vectors. 



Now assume that s > t. According to Lemma 3.7, the vector I s is an extreme ray of 



V(C). Letting C = n s -i < • • • <7r t , Proposition 3.8 we see that all other extreme rays of 



V{C) can be obtained by applying Lemma 3.10 to the extreme rays of V(C). Repeating 
this procedure for the extreme rays of V(C) that do not map to G V(C), we see 
that every extreme ray of V(C) besides I s can be obtained as a vertex of A(C, C u )~ l (V 1 ) 
where C u = tt u < ■ ■ ■ < n t , plus the vertices of A(C, Ct) _1 (e(Afc, A;)). 

To complete the proof of the theorem we must analyze the vertices of A(C, C i ) _1 (e(Afc, A/)) 
and show that the vertices of A(C, C u ) -1 (l") are precisely the induced vectors from the 



traversals of tt u . For both of these statements, we can use Lemma 3.10 
Indeed, A(C, C u ) is the map such that 

(A(C, C u )d)(X, A') = j^—- d(x, y). 

'''' i6A 



This implies, by Lemma 3.10 that the vertices of 

A(C,a)" 1 (e(A,A / )) 
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are |A| • |A'|e(fc, I) such that k G A and I G A'. Since we can ignore the scaling factor |A| ■ |A'| 
when describing extreme rays, taking the union over all pairs A, A' G Tr t , yields the set of 



rays {e(k, I) : k,l are not in the same part of the partition n t } from Theorem 3.6 



Similarly, applying Lemma 3.10 to the map A(C, C u ) and the vector 1" yields the set 



of induced vectors v(F) associated to the partition ir u . Indeed, the coordinate 1 in the 
(A, A') position of l u produces an entry of |A| • |A'| in exactly on of the positions d(x,y) 
such that x G A, y G A'. This completes the proof of Theorem 3.6 □ 



We now show that Theorem 3.6 implies that the UPGMA cones have exponentially 
many extreme rays. 

Proposition 3.11. The cones V(C) have exponentially many extreme rays. 

Proof. Given tt s = Af | • • • |A*, the number of traversals is the product of the pairwise 
products of the cardinalities of the blocks of ir s . So the number of extreme rays induced 
by ti s is 



n 

fti}c( [ ; 



Ia?iia; 



S I s—1 



Given a maximal chain C G n n , the total number of extreme rays will be 



s=2 i=l 



which is exponential. 



□ 



Note that Propositions 3.2 3.3 3.11 and Theorem |3.6| yield Theorem 3.1 



4. Applications of Theorem 13.61 

We use the characterization of the extreme rays of the cones V(C) to provide easy 
geometric applications. First of all, in general, the set V(T) of all dissimilarity maps for 
which UPGMA returns a given tree, is not a convex set in general. Second, the partition 
of the positive orthant into the cones V(C) does not have the structure of a polyhedral fan, 
which means cones do not intersect in their boundary in an especially nice way. Thirdly, 
we show the comb tree topology minimizes the number of rays in a UPGMA cone. 



Corollary 4.1. The UPGMA regions V(C) are not convex in general. 

4. Let T = ((12)(34)). Then V(T) 



V{Ci) U V(C 2 ) 



Proof. We give an example for n 
where 

Ci = 1|2|3|4 < 3|4|12 < 12|34 < 1234 

C 2 = 1|2|3|4 < 1|2|34 < 34|12 < 1234 

Now V\ = (0, 0, 2, 2, 0, 1) is an extreme ray of P{C\) induced by a traversal of 3|4|12 and 
v 2 = (1, 0, 2, 2, 0, 0) is an extreme ray of V(C 2 ) induced by a traversal of 1|2|34. Let d be 
the convex combination 



d 



1 1 

2 Vl + 2 V2 



^,0,2,2,0,^ 
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If d is input into UPGMA, the algorithm will return a tree with either (1,3) or (2,4) 
as a cherry, so d is not in V(T). So, in general, UPGMA regions are not convex unless 
V(T) = V{C) for a single chain C in IT n . □ 

A fan is a family J 7 of cones in IR n such that 

(1) if P G J 7 then every nonempty face of P is in J 7 

(2) if Px, P 2 G T then P x n P 2 G J 7 . 

Corollary 4.2. T7ie UPGMA cones do not partition IrW into a fan. 
Proof. Consider the two chains in II4 

Ci = 1|2|3|4<3|4|12<4|123< 1234 

C 2 = 1|2|3|4<2|4|13<4|123< 1234 



The vector (0,0,0, 1, 1, 1) generates an extreme ray of P(Ci) nP(C 2 ). If P(Ci) nP(C 2 ) 
was a face of P{C\) and P(C 2 ), then (0, 0, 0, 1, 1, 1) would generate a ray of P(C\) and 



P(C 2 ). However by Theorem 3.6, extreme rays of P(C*i) and P(C 2 ) must correspond to 
partitions in II4. Only partitions with 3 blocks induce vectors with 3 nonzero coordinates, 
and no partition of the set [4] has 3 blocks of equal cardinality. So, no traversal of a 
partition in H4 induces a multiple of (0, 0, 0, 1, 1, 1). Therefore the UPGMA cones are not 
a fan. □ 

Corollary 4.3. For each n, the comb tree topology minimizes the number of extreme rays 
over all UPGMA cones in R^) 

Proof. Fix n. We will show that for each 1 < s < n, the partitions whose parts have 
cardinalities l,l,...,l,n — s + 1 minimize the number of traversals for all partitions with 
s parts. For all integers x, y > 0, we have xy > (x + y — 1)(1). So for tt s = • • • \X S S , the 
number of extreme rays induced by 7r s satisfies 

n \ x tm\> n (ixwi+wi-i) 

{<j}c(M) aiK(W) 

The only type of partition in II n with s parts such that all pairs {i, j} C f^ ) satisfy either 
I A* I = 1 or \Xi\ — 1 is the type with s — 1 singleton parts and one part of size n — s + 1. 
Therefore partitions of this type minimize the number of associated induced vectors. 

If C is a maximal chain in Il n such that every tt s in C is of this type, then the tree 
returned by d G V(C) has the comb tree topology. Therefore this tree topology minimizes 
the number of extreme rays for the cone V(C). □ 

5. Spherical Volumes of UPGMA Regions 

A natural way to measure the relative proportion of the region of dissimilarity maps 
V{T) returning the tree T in the positive orthant returning a tree is to calculate the (™) — 1 



POLYHEDRAL COMBINATORICS OF UPGMA CONES 



11 



dimensional measure of the surface arising as the intersection of the cones V(C) C V(T) 

with the unit sphere S in IrW. We refer to this measure as spherical volume. 

We estimated the spherical volume of UPGMA cones in two ways using Mathematica, 
polymake [6j, and the software [TJ. For the first method, we sampled points from the posi- 
tive orthant using a spherical distribution and input the samples into UPGMA, recording 
which tree the algorithm returned on the input point. The volume of V(T) is then the 
fraction of the total sample points returning T. We calculate volumes for n = 4,5,6,7 
using this method. 

For the second method, we used a Monte Carlo strategy to estimate the surface area of 
the cones. For n = 4, 5, 6, we used the software [7] for n = 4, 5, 6. This software requires 
as input triangulations of point configurations that we computed using polymake |6j. For 
n = 7, some triangulations for maximal chains in II7 were too large to compute and use. 
We used Mathematica to implement a modification of the sampling strategy employed in 
[3] along with the UPGMA algorithm. 

The basic strategy using Monte Carlo integration to compute spherical volumes can be 
described as follows. Given a simplicial cone cone(V) spanned by vectors V = Vi, . . . ,v n , 
it is easy to generate uniform samples from the simplex conv(V). The map that takes a 
point x G conv(V) onto cone(V) D S is simply x — >■ a;/||x||2. The spherical volume is then 
the average value of the Jacobian of this map. To calculate the spherical volume of a cone 
V(C) of a full chain in situations where we could only compute a triangulation of a cone 
from a partial chain V(C), we generate random points from the partial cone V(C) and 
compute the average of the product of Jacobian and the indicator function of lying in the 
coneP(C). 

We summarize the results here of those computations for n = 4, 5, 6, 7 leaf trees, only 
displaying results for the regions V(T). In the tables below, we give estimates of the 
spherical volumes of the regions V(T). The column Tree gives the tree in Newick format. 
The column ^Chains refers to the number of cones producing the given tree. The Volume 
column gives the total volume of all of the cones associated to the given tree, and the 
Fraction of Orthant column gives the portion of the positive orthant in IrW that returns 
the given tree topology under UPGMA. 

Recall that V(T) = UV(C) where C ranges over the chains in U n corresponding to T. 
So, the number of cones associated to a tree T depends on the number of rank functions 
that T admits. For example, in the table for n — 5, the tree T 2 = (((12)3) (45)) has 
4!/(4 ■ 2 ■ 1 ■ 1) = 3 rank functions, and there are 3 cones in V(T 2 ). 

A more detailed explanation of the volume computations, as well as software and input 
files, is available at [2]. 





Tree 


# Chains 


Volume 


Fraction of 
Orthant 


1 


(((12)3)4) 


1 


0.0238 


0.5895 


2 


((12)(34)) 


2 


0.0662 


0.4099 
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Tree 


$ Chains 


Volume 


Fraction of 
Orthant 


1 


((((12)3)4)5) 


1 


8.57 x 1(T 5 


0.206 


2 


(((12)3)(45)) 


3 


5.01 x 10" 4 


0.604 


3 


(((12)(34))5) 


2 


3.14 x 10" 4 


0.189 





Tree 


# Chains 


Volume 


Fraction of 
Orthant 


1 


(((((12)3)4)5)6) 


1 


2.05 x 10" s 


0.042 


2 


((((12)3)4)(56)) 


4 


2.10 x 10" Y 


0.216 


3 


((((12)3)(45))6) 


3 


2.16* 10" Y 


0.223 


4 


(((12)3)((45)6)) 


6 


4.5 x 10" Y 


0.229 


5 


((((12)(34))5)6) 


2 


1.05 x 10" Y 


0.054 


6 


(((12)(34))(56)) 


8 


9.06 x 10" Y 


0.231 





Tree 


# Chains 


Volume 


Fraction of 
Orthant 


1 


((((((12)3)4)5)6)7) 


1 


2.75 x 10" ia 


0.0050 


2 


«(((12)3)4)5)(67)) 


5 


4.82 x 10~ i2 


0.0435 


3 


(((((12)3)4)(56))7) 


4 


6.32 x 10~ i2 


0.0570 


4 


((((12)3)4)((56)7)) 


10 


1.95 x 10" 11 


0.1762 


5 


(((((12)3)(45))6)7) 


3 


4.45 x 10~ i2 


0.0402 


6 


((((12)3)(45))(67)) 


15 


5.72 x 10 _ii 


0.2581 


7 


((((12)3)((45)6))7) 


6 


1.66 x 10" 11 


0.0747 


8 


(((12)3)((45)(67))) 


20 


9.00 x 10" 11 


0.2030 


9 


(((((12)(34))5)6)7) 


2 


1.73 x 10" i2 


0.0078 


10 


((((12)(34))5)(67)) 


10 


2.63 x 10" 11 


0.0593 


11 


((((12)(34))(56))7) 


8 


3.33 x 10" 11 


0.0753 



The computations suggest some observations which might hold true for large n. As we 
have shown in Corollary 4.3 the cone associated to the single rank function on the comb 
tree yields the cone V(C) with the fewest number of extreme rays. Our computations 
up to n = 7 suggests that this is also the cone with the smallest spherical volume. See 
[2] for those values. The size of the region V(T) appears be roughly proportional to 
the number of chains C that yield the tree T and appears to be smallest for the comb 
tree. Furthermore, the relative proportion of the positive orthant taken up by the comb 
tree topology appears to be the smallest. We predict that these patterns hold for larger 
number of taxa as well. 
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